There are different places to start with this project. The first thing I want to do is extract the number of stores I will be forecasting reservations for, since the given data is more than the actual number of restaurants that need visitors foretasted. The sample submission includes the dates at the end of the id, so first I need to remove that and then count the number of unique stores.
forecast_stores <- str_sub(sample_submission$id, 1, 20) %>%
unique()
air_resv_unique <- unique(air_reserve$air_store_id)
hpg_resv_unique <- unique(hpg_reserve$hpg_store_id)
air_rel_unique <- unique(store_id_relation$air_store_id)
air_visit_unique <- unique(air_visit_data$air_store_id)We will need to forecast visits for 821 unique stores. We have 314 unique store IDs from the air reservations dataset, and 13325 unique reservation IDs from the hpg information dataset. Given the large amount of hpg reservation information, the relational store information has 150 unique air IDs. Lastly, the actual number of visitors has 829 unique store ids. What does all of this mean?
Well, this means that not all stores take reservations. So when I make my forecasts I will only be able to use reservation data for some of the stores. Why is it the case that not all stores take reservations?
We have information on all of the stores in terms of what types of restaurants there are by genre, name and location. Below is a summary by genre for the air dataset.
# Genre Breakdown
air_store_info %>%
select(air_genre_name) %>%
group_by(air_genre_name) %>%
summarise(Count = n()) %>%
arrange(desc(Count)) %>%
kable(caption = 'air Store Genres',
col.names = c('Genre Name', 'Count')) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)| Genre Name | Count |
|---|---|
| Izakaya | 197 |
| Cafe/Sweets | 181 |
| Dining bar | 108 |
| Italian/French | 102 |
| Bar/Cocktail | 79 |
| Japanese food | 63 |
| Other | 27 |
| Yakiniku/Korean food | 23 |
| Western food | 16 |
| Okonomiyaki/Monja/Teppanyaki | 14 |
| Creative cuisine | 13 |
| Asian | 2 |
| International cuisine | 2 |
| Karaoke/Party | 2 |
Some of the stores we have are bars (including Izakaya), some of them are Cafes - we wouldn’t expect these places to take reservations, and these places make up a sizable portion of the dataset. For those places that don’t take reservations, we will have to use more stochastic methods, while places with reservations can use the additional information.
The hpg store genre breakdown is given below.
hpg_store_info %>%
select(hpg_genre_name) %>%
group_by(hpg_genre_name) %>%
summarise(Count = n()) %>%
arrange(desc(Count)) %>%
kable(caption = 'hpg Store Genres',
col.names = c('Genre Name', 'Count')) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)| Genre Name | Count |
|---|---|
| Japanese style | 1750 |
| International cuisine | 700 |
| Creation | 410 |
| Seafood | 339 |
| Grilled meat | 325 |
| Italian | 249 |
| Spain Bar/Italian Bar | 229 |
| Chinese general | 91 |
| Japanese food in general | 85 |
| Japanese cuisine/Kaiseki | 64 |
| Creative Japanese food | 60 |
| Karaoke | 60 |
| Shabu-shabu/Sukiyaki | 59 |
| Okonomiyaki/Monja/Teppanyaki | 44 |
| Party | 40 |
| Korean cuisine | 38 |
| French | 27 |
| Steak/Hamburger/Curry | 24 |
| Bistro | 22 |
| Cafe | 16 |
| Sushi | 11 |
| Pasta/Pizza | 10 |
| Bar/Cocktail | 7 |
| Amusement bar | 5 |
| Thai/Vietnamese food | 5 |
| Western food | 5 |
| Cantonese food | 4 |
| Sichuan food | 3 |
| Dim Sum/Dumplings | 2 |
| Sweets | 2 |
| Shanghai food | 1 |
| Spain/Mediterranean cuisine | 1 |
| Taiwanese/Hong Kong cuisine | 1 |
| Udon/Soba | 1 |
Using the hpg - air relational dataframe, we will use the hpg reservation/visit information to predict the air store visits, which are the ones that require forecasting for the submission.
Lets explore how many visitors have reserved ahead of their visit. All of the procceding graphs will group together all of the stores to see overall trends in the data.
# When are our reservations made?
air_reserve %>%
group_by(Week = format(as.Date(visit_datetime), "%Y-%W-%m")) %>%
summarise(Visitors = sum(reserve_visitors)) %>%
ggplot(aes(x = Week, y = Visitors, group = 1)) + geom_line() +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = 'air Reservations Made by Week')We can see that there was a big spike in the months leading up to the New Year, and we have weeks with peaks after wards. Our forecasts will cover the visits of last week of April up until the end of May. We have some reservations for this time period already (which includes the “Golden Week” in Japan) as can be seen above, and we can contrast this with the the date of the visits below.
# What days are our reservations made for?
air_reserve %>%
group_by(Week = format(as.Date(reserve_datetime), "%Y-%W-%m")) %>%
summarise(Visitors = sum(reserve_visitors)) %>%
ggplot(aes(x = Week, y = Visitors, group = 1)) + geom_line() +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = 'air Expected Visits by Week')Thus, a lot of our visits will take place during the Golden Week, we were only given just some of the reservations made for our forecast period.
Lets take a look at the data from the hpg.
hpg_reserve %>%
group_by(Week = format(as.Date(visit_datetime), "%Y-%W-%m")) %>%
summarise(Visitors = sum(reserve_visitors)) %>%
ggplot(aes(x = Week, y = Visitors, group = 1)) + geom_line() +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = 'hpg Reservations Made by Week')hpg_reserve %>%
group_by(Week = format(as.Date(reserve_datetime), "%Y-%W-%m")) %>%
summarise(Visitors = sum(reserve_visitors)) %>%
ggplot(aes(x = Week, y = Visitors, group = 1)) + geom_line() +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = 'hpg Expected Visits by Week')A similar story to the air_reserve data.
Lets see how many actual visitors we have in the different stores.
# How many visitors have we actually had
air_visit_data %>%
group_by(Week = format(as.Date(visit_date), "%Y-%W-%m")) %>%
summarise(Visitors = sum(visitors)) %>%
ggplot(aes(x = Week, y = Visitors, group = 1)) + geom_line() +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = 'Actual Visitors')This includes visits among those places which are not reservation based, so we will have to sift those out by merging the store info and actual visitor dataframes. Otherwise we can see that the peaks remain around 100k for actual store visitors.
# I can add the store info to the air visit data
air_visit_info <- merge(air_visit_data, air_store_info, by = 'air_store_id')
rm(air_visit_data)air_visit_info %>%
group_by(Week = format(as.Date(visit_date), "%Y-%W-%m"), Genre = air_genre_name) %>%
summarise(Visitors = sum(visitors)) %>%
ggplot(aes(x = Week, y = Visitors, color = Genre, group = Genre)) + geom_line() +
scale_color_manual(values = colorRampPalette(brewer.pal(8, "Accent"))(14)) +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = 'Actual Visitors')This gives us a better breakdown of actual visitors. We can see that the most popular genre is the Izakaya, which was anticipated from looking at the genre breakdown, followed by Cafe/Sweets. Italian/French, and then the Dining bar. I imagine dining bars take reservations, and so would Italian/French places so places with reservations will probably be on the smaller side of visitor traffic.
This is without the hpg merger however.
First I will clean up the air_reserve dataset. In this instance, I want the total number of people that are expected to visit the different stores for a given date, this means aggregating the data and keeping just the date poriton of datetime.
# I just care about expected visitors on a certain day, not the time nor necessarily the reservation information
air_reserve <- air_reserve %>%
mutate(visit_day = as.Date((visit_datetime))) %>%
select(-c(reserve_datetime, visit_datetime))Next I clean up the hpg reserve dataset to prepare it for a merger with the air_reserve data.
# At least for the first go around in my predicitons, I just want information related to the air stores.
# First I match up the hpg id with the air id
row.names(store_id_relation) <- store_id_relation[, 2]
hpg_reserve$air_store_id <- store_id_relation[hpg_reserve$hpg_store_id, 1]
# Next I keep just the observations with the air stores
hpg_reserve <- hpg_reserve %>%
drop_na() %>%
mutate(visit_day = as.Date((visit_datetime))) %>%
select(-c(reserve_datetime, visit_datetime, hpg_store_id))# I want to have the columns match up to bind properly and I want to change the column order
air_reserve <- air_reserve[, c(1,3,2)]
hpg_reserve <- hpg_reserve[, c(2,3,1)]
air_reserve <- rbind(air_reserve, hpg_reserve)
rm(hpg_reserve)
# Put the reservation and store information into one single dataframe
air_reserve_info <- merge(air_reserve, air_store_info, by = 'air_store_id')
rm(air_reserve)
# I want one aggregate visitor amount per visiting day
air_reserve_info <- aggregate(reserve_visitors ~ air_store_id +
visit_day + air_genre_name +
air_area_name +
latitude +
longitude, air_reserve_info, FUN = sum)Now that I have reserve data, I want to add to it the actual visitor data and make one large dataset.
# First I aggregate the visitor info
air_visit_info <- aggregate(visitors ~ air_store_id +
visit_date + air_genre_name +
air_area_name +
latitude +
longitude, air_visit_info, FUN = sum)
air_visit_info$visit_day <- air_visit_info$visit_date
air_full <- merge(air_reserve_info, air_visit_info[, c(1,7,8)], by = c('air_store_id', 'visit_day'))
rm(air_visit_info, air_reserve_info)Before I look at the reservations and actual visitation patterns, I will first add in the holiday data.
Lets see how well the reservation data matches the visitor data.
air_full$pct_reserve <- air_full$reserve_visitors/air_full$visitors
air_full %>%
ggplot(aes(factor(air_genre_name), pct_reserve)) + geom_boxplot() +
theme(axis.text.x = element_text(angle = 90)) + coord_flip() +
labs(x = 'Genre', y = 'Percent Reserve',
title = 'Boxplot of Percent of Visitors that Reserved vs Genre')As we can see, there are some serious outliers in the data. Lets trim down the data and see if we can establish a relationship then. As it stands now, it seems like there are lot more reservations for the air places of interest than there are actual visitors.
air_full %>%
group_by(air_genre_name) %>%
filter(!(abs(pct_reserve - median(pct_reserve)) > 2*sd(pct_reserve))) %>%
ggplot(aes(factor(air_genre_name), pct_reserve)) + geom_violin(trim = T, scale = T) +
geom_boxplot(width = 0.15) +
theme(axis.text.x = element_text(angle = 90)) + coord_flip() +
labs(x = 'Genre', y = 'Percent Reserve',
title = 'Violin and Boxplot of Percent of Visitors that Reserved\nvs Genre, Trimmed')This looks much better. We can still see that there are quite a few places that get extra reservations, these are places like Cafes, Bars, and Dining Bars. I would’ve thought that it would be more so the other way around, that reservations make up a small percentage of total visitors for that day. This might be because these places fill up quickly, maybe the system takes multiple reservations. I’m not entirely sure. I’ve also used violin plots to show that the median number of reservations do usually fall within a normal figure, i.e, below 100% while the mean can be pulled one way or the other.
Lets also take a look of the breakdown of our data by geographic area. The graph below already trims some of the outliers.
air_full %>%
group_by(day_of_week) %>%
filter(!(abs(pct_reserve - median(pct_reserve)) > 2*sd(pct_reserve))) %>%
ggplot(aes(factor(day_of_week), pct_reserve)) + geom_violin(trim = T, scale = T) +
geom_boxplot(width = 0.15) +
labs(x = 'Day of the Week', y = 'Percent Reserve',
title = 'Violin and Boxplot of Percent of Visitors that Reserved\nper Day of the Week')The plots show that the means are pretty similar for most days of the week. We can see that the medians are lower for Saturday and Sunday, but in general all of these guitar-looking violins reveal that, modally, most places have more visitors than reservations throughout the different days of the week.
When it comes to modeling, the ask is going to be a little tough. This is because the submission requires forecasting visitors for every store for 39 days. The full dataset that I am working with is a panel dataset, so forecasting ahead of the panel will be somewhat tricky. For instance, I would need to look at an ACF of each of the many store ids in order to see which stores have autocorreltaion and which ones don’t.
Additionally, for any linear models, I will have to use a box-cox transformation since there is non-normality (the # of visitors cannot be negative).
What’s my approach? Well, since I did all of the hard work to get here, I might as well try and forecast the overall level of visitors and then use the mean of that as my prediction. Otherwise, I would’ve went with individual predictions for each store id, and I can try that using a fixed effects panel.
I will breakdown the visitor data into 5 genres: Izakaya, Cafe/Sweets, Dining Bar, Italian/French and all others. Although I want to use the reservations to help with my predictions, I think it might be better to go without since I do not have a seperate set of predictors for visitor reservations and visitors. This means that the predictors I use to model the number of reservations will be used to model the number of visitors as well and this two-step regression might not do so well.
First I want to compare the overall visits Izakaya with the aggreagte visits that we would observe with the sample.
izak <- air_full[air_full$air_genre_name=='Izakaya',]
izak.ts <- aggregate(visitors ~ visit_day , FUN = sum, data = izak)
autoplot(ts(izak.ts$visitors)) +
labs(title = 'All Data Izakaya',
x = 'Time',
y = 'Visitors')izak.sample <- izak[which(izak$air_store_id %in% forecast_stores),]
izak.sample.ts <- aggregate(visitors ~ visit_day , FUN = sum, data = izak.sample)
autoplot(ts(izak.sample.ts$visitors)) +
labs(title = 'Sample Submission Izakaya',
x = 'Time',
y = 'Visitors')To forecast this time series, I will try ets and fourier methods on the sample submission aggregates after creating a test/validation split.
# The frequency is set to 7 to represent the weekly fluctuations
izak.train <- ts(izak.sample.ts[1:round(nrow(izak.sample.ts)*0.90), "visitors"], start = 1, frequency = 7)
izak.test <- ts(izak.sample.ts[round(nrow(izak.sample.ts)*0.9):nrow(izak.sample.ts), "visitors"],
start = end(izak.train) + c(0, 0), frequency = 7)I used a 90/10 split because it produces a similar number of periods as those that we have to forecast, namely, we have to forecast 39 days out and the test sample has 49 days.
First I will take a look at the ETS model. The time-series is sesonalized to a frequency of 7, since we have a weekly pattern in the dataset.
izak.ets <- ets(izak.train, allow.multiplicative.trend = T, model = "MAM")
izak.ets.forecast <- forecast(izak.ets, h = 49)
autoplot(izak.ets.forecast, PI = F) +
labs(y = 'Visitors')accuracy(izak.ets.forecast,
x = izak.test) %>% knitr::kable(caption = 'ETS Accuracy Output') %>%
kableExtra::kable_styling(bootstrap_options = "striped",
full_width = F)| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | -34.671107 | 372.9731 | 186.0986 | -22.798045 | 37.36347 | 0.8521535 | 0.2984924 | NA |
| Test set | 9.726515 | 332.6870 | 276.9076 | -3.166493 | 19.97433 | 1.2679716 | 0.6495821 | 0.4546517 |
The ETS provides us with a baseline. We have a model with multiplicative errors, additive trend, and multiplicative seasonality.
Although the Holt-Winters model is technically cycled through when ets() is called, it might perform well on the test set so it is worth taking a look.
accuracy(izak.hw,
x = izak.test) %>% knitr::kable(caption = 'Holt-Winters Accuracy Output') %>%
kableExtra::kable_styling(bootstrap_options = "striped",
full_width = F)| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | -40.21141 | 346.0954 | 199.7395 | -24.0191161 | 40.84157 | 0.9146155 | 0.3098625 | NA |
| Test set | 99.05261 | 398.4692 | 314.1756 | 0.8692316 | 20.50341 | 1.4386234 | 0.3577733 | 0.5193806 |
It does not perform better than the ets() on the test metrics, so it will not be considered.
Hyndman recommends trying a fourier model. I first seasonalize the sample data so that it represents a daily time series.
izak.train <- ts(izak.sample.ts[1:round(nrow(izak.sample.ts)*0.90), "visitors"], frequency = 365)
izak.test <- ts(izak.sample.ts[round(nrow(izak.sample.ts)*0.9):nrow(izak.sample.ts), "visitors"],
start = end(izak.train) + c(0, 0), frequency = 365)# Cycle through different models and select the best fit
min <- Inf
best <- 0
for (k in 1:182) {
fit <- tslm(izak.train ~ poly(trend,2) + fourier(izak.train, K = k))
aic <- as.numeric(CV(fit)[2])
if (aic < min) {
min <- aic
best <- k
}
}
# Plot the bestfit model
bestfit <- tslm(izak.train ~ trend + fourier(izak.train, K = best))
autoplot(izak.train,
main = 'Fourier on Training Data',
ylab = 'Visitors')+
autolayer(bestfit$fitted.values,
series = as.character(best), color = 'red')The fourier model seems to do really well on the training data, lets see the forecasts and the test results
fourier.fc <- forecast(bestfit,
newdata = data.frame(fourier(izak.train, K = best, h = 49)))
autoplot(fourier.fc, PI = F,
main = 'Fourier Forecasts',
ylab = 'Visitors')The forecasts don’t end up looking too good, at least when compared with ets() or holt-winters hw().
accuracy(fourier.fc, izak.test) %>% knitr::kable(caption = 'Fourier Accuracy Output') %>%
kableExtra::kable_styling(bootstrap_options = "striped",
full_width = F)| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | 0.0000 | 204.1248 | 138.6950 | -3.525013 | 29.26996 | 0.1518619 | -0.1607332 | NA |
| Test set | 182.3367 | 693.2873 | 522.2779 | -4.863577 | 33.04589 | 0.5718599 | 0.2864045 | 0.9331468 |
The fourier model has the best MASE of the three models I tried. I would have gone with the fourier if I anticipated the out-of-sample data to look more like the sample data, but I have a strong suspicion that this is not the case. Remember we are expecting the golden week to take place, so it is likely that the multiplicative nature of the data will continue, and the fourier doesn’t capture that. Thus, going with RMSE, I will use ets() for Izakaya forecasts.
sample_submission$air_store_id <- str_sub(sample_submission$id, 1, 20)
sample_submission$date <- str_sub(sample_submission$id, 22, 31)
sample_submission <- merge(sample_submission, air_store_info[, c(1,2)], by=c('air_store_id'), all.x = T, all.y = F)cafe <- air_full[air_full$air_genre_name=='Cafe/Sweets',]
cafe.ts <- aggregate(visitors ~ visit_day , FUN = sum, data = cafe)
autoplot(ts(cafe.ts$visitors)) +
labs(title = 'All Data Cafe/Sweets',
x = 'Time',
y = 'Visitors')cafe.sample <- cafe[which(cafe$air_store_id %in% forecast_stores),]
cafe.sample.ts <- aggregate(visitors ~ visit_day , FUN = sum, data = cafe.sample)
autoplot(ts(cafe.sample.ts$visitors)) +
labs(title = 'Sample Submission Cafe/Sweets',
x = 'Time',
y = 'Visitors')I will also use the cafe sample data since it tracks the overall data pretty well.
# The frequency is set to 7 to represent the weekly fluctuations
cafe.train <- ts(cafe.sample.ts[1:round(nrow(cafe.sample.ts)*0.90),
"visitors"],
start = 1, frequency = 7)
cafe.test <- ts(cafe.sample.ts[round(nrow(cafe.sample.ts)*0.9):nrow(cafe.sample.ts),
"visitors"],
start = end(cafe.train) + c(0, 0), frequency = 7)Lets use ETS again for our Cafe/Sweets data.
cafe.ets <- ets(cafe.train, allow.multiplicative.trend = T, model = "MAM")
cafe.ets.forecast <- forecast(cafe.ets, h = 45)
autoplot(cafe.ets.forecast, PI = F) +
labs(y = 'Visitors')accuracy(cafe.ets.forecast,
x = cafe.test) %>% knitr::kable(caption = 'ETS Accuracy Output') %>%
kableExtra::kable_styling(bootstrap_options = "striped",
full_width = F)| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | -8.151733 | 71.15894 | 49.84128 | -132.75550 | 157.25948 | 1.038857 | 0.1480382 | NA |
| Test set | 22.652709 | 116.42641 | 87.99563 | -15.24103 | 45.12135 | 1.834119 | 0.4771851 | 0.9360908 |
We have our ETS baseline.
This time I will not consider the holt-winters, and jump straight into the fourier.
cafe.train <- ts(cafe.sample.ts[1:round(nrow(cafe.sample.ts)*0.90), "visitors"], frequency = 365)
cafe.test <- ts(cafe.sample.ts[round(nrow(cafe.sample.ts)*0.9):nrow(cafe.sample.ts), "visitors"],
start = end(cafe.train) + c(0, 0), frequency = 365)# Cycle through different models and select the best fit
min <- Inf
best <- 0
for (k in 1:182) {
fit <- tslm(cafe.train ~ poly(trend,2) + fourier(cafe.train, K = k))
aic <- as.numeric(CV(fit)[2])
if (aic < min) {
min <- aic
best <- k
}
}
# Plot the bestfit model
bestfit <- tslm(cafe.train ~ trend + fourier(cafe.train, K = best))
autoplot(cafe.train,
main = 'Fourier on Training Data',
ylab = 'Visitors')+
autolayer(bestfit$fitted.values,
series = as.character(best), color = 'red')The fourier model seems to do really well on the training data, lets see the forecasts and the test results
fourier.fc <- forecast(bestfit,
newdata = data.frame(fourier(cafe.train, K = best, h = 45)))
autoplot(fourier.fc, PI = F,
main = 'Fourier Forecasts',
ylab = 'Visitors')accuracy(fourier.fc, cafe.test) %>% knitr::kable(caption = 'Fourier Accuracy Output') %>%
kableExtra::kable_styling(bootstrap_options = "striped",
full_width = F)| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | 0.00000 | 44.98388 | 29.07108 | -26.58356 | 68.39328 | 0.1556562 | 0.0783414 | NA |
| Test set | 29.74226 | 189.85987 | 152.43916 | -46.84700 | 85.92032 | 0.8162097 | 0.3660412 | 1.526767 |
We get a similar story with our cafe/sweets predictions.
dining <- air_full[air_full$air_genre_name=='Dining bar',]
dining.ts <- aggregate(visitors ~ visit_day , FUN = sum, data = dining)
autoplot(ts(dining.ts$visitors)) +
labs(title = 'All Data Dining Bar',
x = 'Time',
y = 'Visitors')dining.sample <- dining[which(dining$air_store_id %in% forecast_stores),]
dining.sample.ts <- aggregate(visitors ~ visit_day , FUN = sum, data = dining.sample)
autoplot(ts(dining.sample.ts$visitors)) +
labs(title = 'Sample Submission Dining Bar',
x = 'Time',
y = 'Visitors')I will also use the dining sample data since it tracks the overall data pretty well.
# The frequency is set to 7 to represent the weekly fluctuations
dining.train <- ts(dining.sample.ts[1:round(nrow(dining.sample.ts)*0.90),
"visitors"],
start = 1, frequency = 7)
dining.test <- ts(dining.sample.ts[round(nrow(dining.sample.ts)*0.9):nrow(dining.sample.ts),
"visitors"],
start = end(dining.train) + c(0, 0), frequency = 7)This time I will just stick to ETS as my forecasting method of choice.
dining.ets <- ets(dining.train, allow.multiplicative.trend = T, model = "MAM")
dining.ets.forecast <- forecast(dining.ets, h = 45)
autoplot(dining.ets.forecast, PI = F) +
labs(y = 'Visitors')accuracy(dining.ets.forecast,
x = dining.test) %>% knitr::kable(caption = 'ETS Accuracy Output') %>%
kableExtra::kable_styling(bootstrap_options = "striped",
full_width = F)| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | -14.87462 | 155.4840 | 91.01057 | -67.066315 | 89.63027 | 0.8978197 | 0.3821467 | NA |
| Test set | 18.53166 | 202.2967 | 152.40547 | -4.032356 | 29.45463 | 1.5034808 | 0.3749917 | 0.7581085 |
dining.ets.full <- ets(ts(dining.sample.ts["visitors"], start = 1, frequency = 7))
dining.fc <- forecast(dining.ets.full, h = 39)
dining.forecast <- data.frame(date = sample_submission$date[1:39], fc = as.numeric(dining.fc$upper[,2])/nrow(air_store_info[air_store_info$air_genre_name=='Dining bar',]))itfr <- air_full[air_full$air_genre_name=='Italian/French',]
itfr.ts <- aggregate(visitors ~ visit_day , FUN = sum, data = itfr)
autoplot(ts(itfr.ts$visitors)) +
labs(title = 'All Data Italian/French',
x = 'Time',
y = 'Visitors')itfr.sample <- itfr[which(itfr$air_store_id %in% forecast_stores),]
itfr.sample.ts <- aggregate(visitors ~ visit_day , FUN = sum, data = itfr.sample)
autoplot(ts(itfr.sample.ts$visitors)) +
labs(title = 'Sample Submission Italian/French',
x = 'Time',
y = 'Visitors')I will also use the Italian/French sample data since it tracks the overall data pretty well.
# The frequency is set to 7 to represent the weekly fluctuations
itfr.train <- ts(itfr.sample.ts[1:round(nrow(itfr.sample.ts)*0.90),
"visitors"],
start = 1, frequency = 7)
itfr.test <- ts(itfr.sample.ts[round(nrow(itfr.sample.ts)*0.9):nrow(itfr.sample.ts),
"visitors"],
start = end(itfr.train) + c(0, 0), frequency = 7)Again I will stick to ETS as my forecasting model of choice.
itfr.ets <- ets(itfr.train, allow.multiplicative.trend = T, model = "MAM")
itfr.ets.forecast <- forecast(itfr.ets, h = 49)
autoplot(itfr.ets.forecast, PI = F) +
labs(y = 'Visitors')accuracy(itfr.ets.forecast,
x = itfr.test) %>% knitr::kable(caption = 'ETS Accuracy Output') %>%
kableExtra::kable_styling(bootstrap_options = "striped",
full_width = F)| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | -3.471546 | 145.9475 | 78.20574 | -25.298070 | 45.03443 | 0.7787832 | 0.3850765 | NA |
| Test set | 82.516461 | 147.3630 | 124.20539 | 8.809987 | 16.34202 | 1.2368538 | 0.2630564 | 0.5240515 |
There seems to be a good fit for this model, by that I mean the RMSE of the training and test set are very similar, so I would expect this prediciton to be robust.
I repeat the same exercise for all other stores.
Again I will stick to ETS as my forecasting model of choice.
other_stores <- unique(air_store_info$air_genre_name)[-c(1,2,4,5)]
for (store in other_stores) {
other <- air_full[air_full$air_genre_name == store,]
other.ts <- aggregate(visitors ~ visit_day , FUN = sum, data = other)
other.ets.full <- ets(ts(other.ts["visitors"], start = 1, frequency = 7))
other.fc <- forecast(other.ets.full, h = 39)
other.forecast <- data.frame(date = sample_submission$date[1:39],
fc = as.numeric(other.fc$upper[,2])/nrow(air_store_info[air_store_info$air_genre_name==store,]))
for (i in 1:nrow(sample_submission)) {
if (sample_submission[i, 5] == store) {
sample_submission[i, 'visitors'] = other.forecast[other.forecast$date == sample_submission[i,4], 2]
}
}
}This will take a long time to run, it is commented out by default, but can be run by removing the string.
'for (i in unique(sample_submission$air_store_id)) {
single <- air_visit_data[air_visit_data$air_store_id == i,]
visit <- ts(single$visitors, frequency = 7)
single.ets <- ets(visit)
single.fc <- forecast(single.ets, h = 39)
single.forecast <- data.frame(date = as.numeric(as.Date(sample_submission$date[1:39])),
fc = as.numeric(single.fc$upper[,2]))
for (j in 1:nrow(sample_submission)) {
if (sample_submission[j,1] == i){
sample_submission[j,3] <- single.forecast[single.forecast$date == as.numeric(as.Date(sample_submission[j,4])), 2]
}
}
}'## [1] "for (i in unique(sample_submission$air_store_id)) {\n single <- air_visit_data[air_visit_data$air_store_id == i,]\n visit <- ts(single$visitors, frequency = 7)\n single.ets <- ets(visit)\n single.fc <- forecast(single.ets, h = 39)\n single.forecast <- data.frame(date = as.numeric(as.Date(sample_submission$date[1:39])),\n fc = as.numeric(single.fc$upper[,2]))\n \n for (j in 1:nrow(sample_submission)) {\n if (sample_submission[j,1] == i){\n sample_submission[j,3] <- single.forecast[single.forecast$date == as.numeric(as.Date(sample_submission[j,4])), 2]\n }\n }\n}"